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Abstract 

In this paper, a novel framework based on trace norm minimization for audio segment 
is proposed. In this framework, both the feature extraction and classification are obtained 
by solving corresponding convex optimization problem with trace norm regularization. For 
feature extraction, robust principle component analysis (robust PCA) via minimization a 
combination of the nuclear norm and the ^i-norm is used to extract low-rank features which 
are robust to white noise and gross corruption for audio segments. These low-rank features 
are fed to a linear classifier where the weight and bias are learned by solving similar trace 
norm constrained problems. For this classifier, most methods find the weight and bias in 
batch-mode learning, which makes them inefficient for large-scale problems. In this paper, 
we propose an online framework using accelerated proximal gradient method. This frame- 
work has a main advantage in memory cost. In addition, as a result of the regularization 
formulation of matrix classification, the Lipschitz constant was given explicitly, and hence 
the step size estimation of general proximal gradient method was omitted in our approach. 
Experiments on real data sets for laugh/non-laugh and applause/non-applause classification 
indicate that this novel framework is efi'ective and noise robust. 

1 Introduction 

Audio feature extraction and classification methods have been studied by many researchers over 
the years [1] [31 |21 E]. In general, audio classification can be performed in two steps, which 
involves reducing the audio sound to a small set of parameters using various feature extraction 
techniques and classifying or categorizing over these parameters. Feature commonly exploited 
for audio classification can be roughly classified into time domain features, transformation do- 
main features, time-transformation domain features or their combinations [HIS]. Many of those 
features are common to audio signal processing and speech recognition and have many success- 
ful performances in various applications. However almost all these features are based on short 
time duration and in vector form (it is easy to handle but sometimes not proper), although it 
is believed that long time duration (seconds) help a lot in decision making. In this work we 
will build robust features on a long time duration in matrix form which is the most natural way 
using long time audio information. 

In order to map or smooth the audio segment into a robust matrix space, we introduce the 
trace norm regularization technique to audio signal processing. The trace norm regularization 
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is a principled approach to learn low-rank matrices through convex optimization problems [7]. 
These similar problems arise in many machine learning tasks such as matrix completion [5], 
multi-task learning [S] , robust principle component antilysis (robust PCA) [TUl [TT] , and matrix 
classification [12]. In this paper, robust PCA is used to extract matrix representation features 
for audio segments. Unlike traditional frame based vector features, these matrix features are 
extracted based on sequences of audio frames. It is believed that in a short duration the signals 
are contributed by a few factors. Thus it is natural to approximate the frame sequence by low- 
rank features using robust PCA which assumes that the observed matrices are combinations of 
some low-rank matrices and some corruption noise matrices. 

Having extracted descriptive features, various machine learning methods are used to provide 
a final classification of the audio events such as rule-based approaches, Gaussian mixture models, 
support vector machines, Bayesian networks, and etc. [4j[5l[6]. In most previous work, these two 
steps for audio classification are always separate and independent. In this work, we can learn 
the classifiers in solving similar optimization problems using trace norm regularization. After 
extraction of the robust low-rank matrix feature, the regularization framework based matrix 
classification approach proposed by Tomioka and Aihara in [T^] is used to predict the label. 

The problem of matrix classification (MC) with spectral regularization was first proposed by 
Tomioka and Aihara in [T^] . The goal of the problem is to infer the weight matrix and bias under 
low trace norm constraints and low deviation of the empirical statistics from their predictions. 
The trace norm was use to measure the complexity of the weight matrix of the linear classifier 
for matrix classifications. This kind of inference task belongs to the more general problem of 
learning low-rank matrix through convex optimization. For the matrix rank minimization is 
NP-hard in general due to the combinatorial nature of the rank function, a commonly-used 
convex relaxation of the rank function is the trace norm (nuclear norm) [7] , defined as the sum 
of the singular values of the matrix. 

Recent related researches are not focused on matrix classification directly, but rather on 
general trace norm minimization problem |13 [ I14 1 [T5 ] . These general algorithm can be adapted to 
matrix classification suitably. In these methods, most are iterative batch procedures [l3l [Ml [15] . 
accessing the whole training set at each iteration in order to minimize a weighted sum of a cost 
function and the trace norm. This kind of learning procedure cannot deal with huge size training 
set for the data probably cannot be loaded into memory simultaneously. Furthermore it cannot 
be started until the training data are prepared, hence cannot effectively deal with training data 
appear in sequence, such as audio and video processing. 

To address these problems, we propose an online approach that processes the training sam- 
ples, one at a time, or in mini-batches to learn the weight matrix and the bias for matrix clas- 
sification. We transform the general batch-mode accelerated proximal gradient (APG) (TSj [14] 
method for trace norm minimization to the online learning framework. In this online learning 
framework, a slight improvement over the exact APG leads an inexact APG (lAPG) method, 
which needs less computation in one iteration than using exact APG. In addition, as a special 
case of general convex optimization problem, we derived the closed-form of the Lipschitz con- 
stant, hence the step size estimation [TS] [M] of the general APG method was omitted in our 
approach. 

Our main contributions in this work can be summarized as follows: 

1. To our best knowledge, we are the first to introduce low-rank constraints in audio and 
speech signal processing, and the results show that these constrains make the systems 
more robust to noise, especially to large corruptions. 
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2. We propose online learning algorithms to learn the trace norm minimization based matrix 
classifier, which make the approaches work in real applications. 



The paper is organized as follows: Section [5] presents the extraction of matrix representation 
feature. Section|3]presents the matrix classification problem solving via the general APG method 
and the proposed audio event detection with matrix classification. The proposed online methods 
with exact and inexact APG for weight and bias learning are introduced in Section [521 Section [S] 
is devoted to experimental results to demonstrate the characteristics and merits of the proposed 
algorithm. Finally we give some concluding remarks in Section |51 

2 Low-Rank Matrix Representation Features 

Over the past decades, a lot work has been done on audio and speech features for audio and 
speech processing [21 [31 [S] . Due to convenience and the short-time stationary assumption, these 
features are mainly in vector form based on frames, although it is believed that features based 
on longer duration help a lot in decision making. In order to build long term features, the 
consecutive frame signals are made together as rows, then the audio segments become matrices. 
Generally, it is assumed and believed that the consecutive frame signals are influenced by a few 
factors, thus these matrices are combinations of low-rank components and noise. Hence it is 
natural to approximate these matrices by low-rank matrices. In this work, transformations of 
these approximate low-rank matrices are used as features. 

Given an observed data matrix D £ R™^", where m is the number of frames and n represents 
the number of samples in a frame, it is assumed that it can be decomposed as 



where A is the low-rank component and E is the error or noise matrix. The purpose here is to 
recover the low-rank component without knowing the rank of it. For this problem, PGA is a 
suitable approach that it can find the low-dimensional approximating subspace by forming a low- 
rank approximation to the data matrix [16j . However, it breaks down under large corruption, 
even if that corruption affects only a very few of the observation which is often encountered in 
practice . To solve this problem, the following convex optimization formulation is proposed 



where || • ||* denotes the trace norm of a matrix which is defined as the sum of the singular 
values, II • 111 denotes the sum of the absolute values of matrix elements, and A is a positive 
regularization parameter. This optimization is refereed to as robust PCA in [TU] for its ability 
to exactly recover underlying low-rank structure in data even in the presence of large errors or 
outliers. In order to solve Equation ([2]), several algorithms have been proposed, among which 
the augmented Lagrange multiplier method is the most efficient and accurate at present [TT| . 
In our work, this robust PGA method is employed for the low-rank matrix extraction. 

In order to apply the augmented Lagrange multiplier (ALM) to the robust PGA problem, 
Lin et. al. [TT] identify the problem as 



D = A + E 



(1) 



mm 



||yl||* + A||£;||i, subject to D = A + E, 



(2) 



X = (A E),f{X) = \\A\\, + All^lli, and h{X) = D-A-E, 



(3) 
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and the Lagrangian function becomes 

L{A,E,Y,^i) = \\A\\,. + X\\E\\i+<Y,D-A-E>+^\\D-A-E\\%. (4) 

Two ALM algorithms to solve the above formulation are proposed in [11] . Considering a balance 
between processing speed and accuracy, the robust PCA via the inexact ALM method is chosen 
in our work. Thus the matrix representation feature extraction process based on this approach 
is summarized in Algorithm [2l In Algorithm [2l J{D) is defined as the larger one of ||-D||2 and 
'^"^ll^lloo, where || • ||oo is the maximum absolute value of the matrix elements. The is the 
soft-thresholding operator introduced in |11| . 

Fig. [1] shows the recovered low-rank matrices via applying robust PCA to the matrix form of 
a typical laugh sound effect audio segment with or without corruptions. In which, the regular- 
ization parameter is fixed as 1. It can be seen that robust PCA extracted matrices are robust 
to large errors and Gaussian noise. Ideally, these above recovered low-rank matrices can be 
used as features directly. But in order to balance the speed and performance, in this work the 
we transform the recovered low-rank matrices into MFCCs (mel-frequency cepstral coefficients) 
matrices. All rows in the low-rank matrices are transformed into MFCCs independently. Fig. [5] 
shows the spectrograms of the signal in Fig. [T] respectively. It seems that the spectrograms of 
the low-rank components vary not much compare to the spectrograms of the corrupted signals. 



Recovering of Low-rank Component from Audio Segments via RPCA. 
Input: D e M™^" (matrix form of the audio segment). 
Initialize: D £ R"'^",ro = D/J{D),Eo = 0, /xo > 0, p > 1, fc = 0. 
1: while not converged do 

2: // Lines 3-4 solve Ak+i = aigmm L{A, Ek,Yk, Hk)- 

3: ([/, 5, V) = svd(i^ -Ek+ fik'Yk). 
4: Ak^US^-i[S]V^. 

5: // Line 6 solves Ek+i = aigmmL{Ak+i, E,Yk, fik)- 

E 

6: Ek+i - S^^-i [D - Ak+i + Mfc 

7: Yk+i = Ifc + ^lk{D - Ak+i - Ek+i). 

8: Update ^k to /ifc+i. 

9: k^k + l. 

10: end while 

Output: W ^ Wk- 



3 Low Rank Matrix Classification 
3.1 Notation and Problem Statement 

Having extracted robust matrix representation features, the linear matrix classification approach 
based on trace norm regularization framework proposed in |12) is used to classify them. The 
motivation for trace norm regularization framework is two fold: a) trace norm considers the 
interactive information among the frames in the matrix while the simple approach that treat 
the matrix as a long vector would lose the information; b) trace norm is a suitable quantity 
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Figure 1: Matrix form of audio segments with or without noise and extracted matrix features 
via Robust PCA with A = 1 throughout, (a) Matrix form of a typical laugh sound effect audio 
segment; (b) The low-rank component recovered from (a) via robust PCA; (c) Matrix form of 
the same audio segment corrupted by white Gaussian noise with SNR=20dB; (d) The low-rank 
component recovered from (c) via robust PCA; (e) Matrix form of the same audio segment cor- 
rupted by white Gaussian noise and random large errors; (f ) The low-rank component recovered 
from (e) via robust PCA. 
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Figure 2: spectrograms of low-rank approximations of the audio segments with or without noise 
with A = 1 throughout, (a) Spectrograms of the original laugh segment; (b) Spectrograms of 
the low-rank approximation of the laugh segment; (c) Spectrograms of the same audio segment 
corrupted by white Gaussian noise with SNR=20dB; (d) Spectrograms of the low-rank approx- 
imation of the laugh segment with white Gaussian noise; (e) Spectrograms of of the same audio 
segment corrupted by white Gaussian noise ar6i random large errors; (f) Spectrograms of the 
low-rank approximation of the laugh segment white Gaussian noise and random large errors. 



that measures the complexity of the Unear classifier. Generally, the problem for trace norm 
regularization based matrix classification is formulated as 



mmF,iW,b)^UW,b) + \\\W\\^ (5) 

W,b 

where W € Jj^x" is the unknown weight matrix^ & e R is the bias, ||-||^ denotes the trace norm 
defined as the sum of the singular values, and A is the regularization parameter. fs{W,b) = 

s 

£{yi,Ti{W'^ Xi) + b) is the empirical cost function induced by some convex smooth loss 

function £(■,■), where Tr(-) denotes the trace, the subscript of fs{W,b) indicates the num- 
ber of training samples or time of training procedure which is apparent from context, and 
{Xi, Ui) e M™x" X R is the zth sample. In this work, the standard squared loss function is used. 

s 

Hence the empirical cost function becomes fs{W, b) = J2 iVi ~ Tr(M^-'"Xi) — fe)^. 

1=1 



3.2 APG Method for Matrix Classification 

Recently Toh and Yun [T3], Ji and Ye [Tl], and Liu et al. [T^ independently proposed similar al- 
gorithms that converge as O(-p-) for problem ([S|) by using APG, where k is the iteration counter. 
The precondition of using APG algorithm is that the loss function should be smooth, convex, 
and the gradient should satisfy Lipschitz condition. Since fs{W, b) in this work is a composition 
of smooth convex function with an affine mapping, hence it is convex and smooth |18j . For 
Lipschitz continuous, it is shown in Theorem [T] that the gradient of denoted as 

s 

Vwfs{W, b) = -2^ (y, - Tr{W^Xi) - b)X„ (6) 

i=l 

is Lipschitz continuous. Thus the APG method can be used to solve matrix classification prob- 
lem. In order to solve the unconstrained convex optimization problem ([5]), APG approximate 
fs(W, b) locally as a quadratic function with bias fixed and solve 

W^fc+i=arg min Q{W,Zk) ^ fs{Zk,b) + %\\W - Zk\\l 
+ < Vwfs{Zk,b), W-Zu> +A \\W\\^ , 

which is assumed to be easy, to update the solution W . Based on the the work of Nesterov jl9l 
[20] . Toh and Yun [13], Ji and Ye [l^, and Liu et al. il5j showed that setting Zk = Wh ^ 
^'"t^^ O^k — Wk-i) for a sequence tk satisfying t1_^_^ — t^+i < t\ results in a convergence rate of 
O(^). Due to Lemma[Tl the estimation of step size tk in general APG [T31[T31[TS] is omitted, for 
we have explicit Lipschitz constant. The APG approach for batch-mode weight matrix learning 
is described in Algorithm 13.21 The Se[-\ in Algorithm 13.21 is the soft-thresholding operator 
introduced in [TTl: 

{X — e, if a; > e, 
X -I- e, if X < — e, (8) 
0, otherwise 

where x G R and e > 0. For vectors and matrices, this operator is extended by applying 
element-wise. 
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Batch-Mode Weight Matrix Learning via APG 

s 



Initialize Wq = Zi e M™^", ai = 1, L = 2mn J2 \\X^\\p , A. 



while not converged do 

{U, S, V) = svd(Zfe - i(-2 E-=i (y. - Tr(ZjX,) - 
VFfc = USx[S]V'^. 



1 
2 
3 

l + ^l+4a| 

4: ttfe+i = ^. 

5: Zfe+i = vt^fe + ftrr^^fc - ^fc-i)- 

6: &fc-ii:(2/-Tr(TyJX0)- 

7: A: ^ k + 1. 
8: end while 
Output: W ^Wk- 



The general APG [T31 [HI US] algorithms only provide the methods for learning weight ma- 
trices, do not give out the bias updating rules. In order to update the bias b, fixes the weight 
matrix Wk and solve the following problem 



bk - mm^ (j;, - MW^X,) - 6)2 -f A \\WkL , (9) 
1=1 

which results in the bias updating rule 

bk = -i2^y^-^r{w^x,)). (10) 

This results in the line 6 of Algorithm 13.21 For the stopping criteria of the iterations, we take 
the following relative error conditions: 

\\Wk+i - Wk\\F/\\Wk\\F < £i and \bk+i - bk\/\bk\ < 62. (11) 

After the weight matrix W and bias b are found, the observed MFCCs matrix Xi can be 
classified via 

y, = TT{W^X,) + b. (12) 
3.3 Determination of Lipschitz Constant 

As a special case of general convex optimization problem, we derived the closed-form of the 
Lipschitz constant, hence the step size estimation [13l [14] of the general APG method was 
omitted in all our approach. The determination of the Lipschitz constant is shown in the 
following theorem. 

Theorem 1. ^wfs{',b) is Lipschitz continuous with constant L — 2mn ^ ||A"i||j^, i.e., \/U,V € 

1=1 

\\Vwfs{U,b)-Vwh{VM\F<L\\U -V\\p, (13) 
where \\-\\p denotes the Frobenius norm. 
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Proof. Applying Equation ^ with f/, V to the right of Equation p^ . we obtain 



\\\7wfs{U,b)~\/wfs{V,b)\\p 
= 11-2^' (y, - Tr((7^X,) - 

-t 1 

<2mnE^^J|;7^-F^||^||X,||^ 
={2mnJ2l^jX,\\l)\\U^-V^\\^, 

where in the last inequality, the easily verified fact that Tt{A^B) < \\A\\-^ \\B\\^ < mn \\A\\p \\B\\p 
for VA, B G R™^" is used. Here || -H denotes the ^l norm which is the sum of the absolute values 
of the matrix elements. 

Thus the lemma is proofed, that is to say Vvf/s('j&) is Lipschitz continuous with constant 

The APG based batch-mode weight learning method is effective for small training set, but 
with large training sets, this classical optimization technique may become impractical in terms of 
memory requirements. Furthermore, this method cannot efficiently deal with dynamic training 
data of time sequences, such as audio and video processing. To tackle the insufficiency, we 
propose an online learning framework in the following section. 



4 Online Learning for Matrix Classification 
4.1 Online Learning with APG 

We present in this section the basic components of our online learning algorithm for matrix 
classification, as well as a few minor variants which speed up our implementation in practice. 

Our procedure is summarized in Algorithm 14.11 The ® operator in step 6 of the algorithm 
denotes the Kronecker product. Given two matrices A e R"iiX"i and B e R^^xna^ A (g) B 
denotes the Kronecker product between A and B, defined as the matrix in R™i™2xnin2^ defined 
by blocks of sizes TO2 x equal to A[i,j\B. GridTr(Zfe.t, in step 13 denotes an operator 
with input Zk,t € K™^" and Bt G M'»™xrm^ ^.^g^^^- -^^ j^mxn ^-^^-^ ^j^g (i, j)th element defined as 
the trace of the product between Zj^ and the (i, j)th ]R™xn j^^ock of Bt- 

Assuming the training set composed of i.i.d. samples of a distribution p{X^ y), its inner loop 
draws one training sample {Xt,yt) at a time. This sample is first used to update the "past" 
information At^i, -Bt-i, ct-i, and Dt-i- Then the Algorithm [321 is applied to update the weight 
matrix with the warm start Wt~i obtained at the previous iteration. Since Ft{W^ &t-i) is relative 
close to Ft-i[W,ht-i) for large values of i, so are Wt and Wt-i, under suitable assumptions, 
which makes it efficient to use Wt-i as warm restart for computing Wt- 
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Online MC Learning Based on APG. 
Initialize Wa G R™^", &o S R, = 0, A e M. 

1: Ao e K™^" ^ 0,Bo e K^'^x"" f- 0, co G M ^ 0, L'o e <- 0(reset the "past" 

information). 

2: for t = 1 to T do 

3: Draw training sample {Xt, yt) from p{X, y). 
4: // Line 5-9 update "past" information. 
5: At 4- At^i+ytXt] 
6: 

7: ct ^ ct-i + yt, 

8: Dt^Dt^i+Xt, 

9: ^Lt_i+2mn||Xj^. 

10: // Line 11-19 update Wt and bt using Algorithm 13.21 with Wt-i and 6f_i as warm restart. 
11: VFo,t = = Wt-i e M"^", = ^t-i>«i - l,fc - 1- 
12: while not converged do 

13: ((7, 5, V) = svd(Zfc,t - :^(-2At + 2GridTr(Zfc,t, B*) + 2bk-i,tDt). 

14: Wk.t^US^[S]V^. 



15: ttfe+i = — 

16: Zfe+i,t = VFfe,t + ^{Wk,t - Wk^i^t). 

17: b,,t^^ict-TviWl,Dt) 

18: /c ^ fc + 1. 

19: end while 

20: Wt^Wk,t,bt^bk,t. 

21: end for 



Output: W ^ WT,b ^ br- 
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4.2 Online Learning with inexact APG 

Algorithm 14.11 calls APG to update the weight matrix for each coming sample by solving the 
sub-problem with fixed bias b 

t 

Wt = min V (y, - TiiW^X,) ~ h-if + A (14) 
w '—^ 

1=1 

exactly which cause computational load for large scale training set. Fortunately, due to the 
closeness of consecutive weight matrix, we do not have to solve the sub-problem exactly. Rather, 
updating Wt-\ once when solving this sub-problem is sufficient in practice. This leads to an 
online MC learning method based on inexact APG, described in Algorithm 14.21 



Online MC Learning with Inexact APG. 
Initialize Wo G K™^", 6o G K, = 0, A e M. 

1: Aq e M™^" 0,Bo e w^ray.nn ^ Q, Co G R ^ 0, Do £ K"''" ^ (reset the "past^ 

information). 

2: for t = 1 to T do 

3: Draw training sample (Xf, j/t) from p(X, y). 

A: / / Line 5-9 update "past" information. 

5: At ^ At^i+ytXt; 

6: Bt ^ Bt-i + Xt (g) Xf 

7: ct <~ ct-i + Vu 

8: Dt^Dt^i+Xt. 

9: Lt ^ Lt^i+2mn\\Xt\\l. 

10: // Line 11-16 compute Wt using inexact APG, with Wt^i as warm restart. 
11: Wo,t = Wt-i € K™''". 

12: {U,'S,V) = svdiWo.t - ^(-2^ + 2GridTr(Wo,t, + 26i_iA). 
13: Wi.t = US^[S]V^. 

14: ([/, S, V) = svd(Wi,t - ^i~2At + 2GridTr(Wi,t, B*) + 2bt-iDt). 
15: W2,t = US_^[S]V^. 
16: Wt ^ W2,t-' 

17: / / Line 18 updates the bias bt- 
18: 6t = i(ct-Tr(W,^A) 
19: end for 

Output; W WT,b 6t- 



4.3 Online Learning with Mini-batch 

In some conditions, use the classical heuristic in gradient descent algorithm, we may also improve 
the convergence speed of our algorithm by drawing /i > 1 training samples at each iteration 
instead of a single one. Let us denote by {Xt^i, yt,i), (^t,/^, yt,fi) the samples drawn at iteration 
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t. We can now replace lines 5 and 9 of Algorithm 14.11 and 14.21 by 

At ^ At-i + J2 yt,zXt,i, 

i=l 
i=l 

ct Q-1 + Vt,i-, (15) 
1=1 

Dt ^ Dt-i + £ Xt,u 
1=1 

Lt^ Lt-i+i2'^^n\\Xt.^\\l. 
1=1 

But in real applications, this batch method may not improve the convergence speed on the 
whole since the batch past information computation (Equation (llSp ) would occupy much of the 
time. The updating of Bt needs to do Kronecher product which spend much of the computing 
resource. If the computation cost of Equation (|15l) can be ignored or largely decreased, for 
example by parallel computing, the batch method would increase the convergence speed by a 
factor of ^. 

5 Experimental Validation 

5.1 Dataset 

Experiments are conducted on a collected database. We downloaded about 20hours videos from 
Youku [21], with different programs and different languages. The start and end position of all 
the applause and laugh of the audio-tracks are manually labeled. The database includes 800 
segments of each sound effect. Each segment is about 3-8s long and totally about Ihour data 
for each sound effect. All the audio recordings were converted to monaural wave format at a 
sampling frequency of 8kHz and quantized 16bits. Furthermore, the audio signals have been 
normalized, so that they have zero mean amplitude with unit variance in order to remove any 
factors related to the recording conditions. 

5.2 Online Learning 

In this section, we conduct detailed experiments to demonstrate the characteristics and merits 
of the online learning for matrix classification problem. Five algorithms are compared: the 
traditional batch algorithm with exact AFG algorithm (AFG); the online learning algorithm 
with exact AFG (OL_AFG); the online learning algorithm with inexact AFG (OLJAFG); the 
online learning algorithm with exact AFG and update Equation (fT5|) (OL_AFG_Batch); the 
online learning algorithm with inexact AFG and update Equation ([Tsl) (OLJAFG_Batch). All 
algorithms are run in Matlab on a personal computer with an Intel 3.40GHz dual-core central 
processing unit (CFU) and 2GB memory. 

For this experiment, audio streams were windowed into a sequence of short-term frames 
(20 ms long) with non overlap. 13 dimensional MFCCs including energy are extracted, and 
adjacent 50 frames (one second) of MFCCs form the MFCCs matrix feature. The goal is to 
classify the matrices according to their labels. Two learning tasks are used to evaluate the per- 
formance of the online learning method, which are laugh/non-laugh segment classifier learning 
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and applause/non- applause segment classifier learning. For OL_APG and OL_APG_Batch algo- 
rithms, the parameters in the stopping criteria pj|) are set ei — 10~® and £2 = 10^® or smaller, 
which are determined by empirical evidence that larger values would make the algorithm di- 
verge. The regularization constant A is anchored by the large explicit fixed step size L and the 
matrices involved, this can be seen from ^ in the line 3 in Algorithm 13.21 which means that in 
practice the parameter A should be set adaptably with the step size L in the online process. But 
due to this variation of A, the comparisons between the algorithms would not bring into effect. 
Hence in this work we use A = 1 throughout. 

Fig. |3] compares the five online algorithms. The proposed online algorithm draws samples 
from the entire training set. We use a logarithmic scale for the computation time. Fig.|3^ shows 
the values of the target functions as functions of time. It can be seen that the online learning 
methods without batch or with small batch past information updating converge faster than the 
methods with large batch past information updating and reason for this has been explained 
in the last paragraph of Section 15.21 After online methods and batch methods converge, the 
two methods result in almost equal performance. Fig. [3jb)(d) shows the classification rates 
for different algorithms respectively. In accordance with the values of the target functions, the 
classification accuracies of online methods without or with small batch updating become stable 
quickly than that of methods with batch updating. Although the inexact algorithms process 
samples much fast with less resources than exact ones, they converge slowly. 

5.3 Robustness 

This section is to assess the effectiveness of robust PGA extracted low-rank matrix features. 
Original features (MFGGs_Matrix), corrupted with OdB and -5dB white Gaussian noise (WGN 
SNR=5dB, OdB, -5dB) and 10%, 30%, 50% random large errors (LE 10%, 30%, 50%), and 
parallelism robust PGA extracted features (rPGA) are compared. In the comparisons, the 
parameters in the stopping criteria (jlip are set ei = 10~^ and £2 = 10~^, which are determined 
by the same method as in Section [5?2] The regularization constant A is set l/-\/50 which is a 
classical normalization factor according to [22j . 

The classification accuracy of the one second audio segments is used to evaluate the perfor- 
mance of the methods. Fig. |4] shows the performances of the methods with different matrix fea- 
tures under different noise conditions as the functions of the training time used in Algorithm l3.2l 
It can be seen that the original MFGGs matrix feature is not robust to noises, especially random 
large errors. If 10% of the elements of the MFGGs matrix feature are corrupted with random 
large errors, then generally there would be a decrease of 25% in audio segments classification 
accuracy, while for robust PGA extracted low-rank features, the decrease are 5% in average. 
For WGN, the robust PGA features also perform better than original features, although not so 
sharp as in the situation of large errors. The experiments show that the low-rank components 
are more robust to noises and errors than the original features. 

We also compare our method with the state-of-the-art SVM classifier with long vector feature 
(650 dimension) obtained by vectorizing the matrix. The results are summarized in Table [T] and 
Table [2] for applause/non-applause and laugh/non-laugh classification respectively. The results 
show that the SVM become useless under 5dB wight noise and 10% large corruptions, while 
our methods still works. But for the low-rank component, the SVM performs better on some 
situations for which is due to the robustness of the features. 
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Figure 3: Comparisons between various online learning methods and results are reported as 
functions of learning time on a logarithmic scale, (a) Value of target function for online learning 
of applause segments classifier; (b) Classification rate on audio segments of testing data for 
online learning of applause segments classifier; (c) Value of target function for online learning of 
applause segments classifier; (d) Classification rate on audio segments of testing data for online 
learning of laugh segments classifier. 



Table 1: Performance comparison between our approach and SVM classification on long vector 



method for applause/non- applause segment classification. 



Approach 


Normal 


SNR=-5dB 


SNR^OdB 


SNR=5dB 


LE=10% 


LE=30% 


LE=50% 


SVM+LV 


81.88% 


64.07% 


64.07% 


64.07% 


64.07% 


64.07% 


64.07% 


APG+MFCCs.Matrix 


82.76% 


51.11% 


55.87 


61.76% 


52.78% 


52.10% 


51.16% 


SVM+rPCA LV 


81.88% 


64.07% 


64.07% 


64.07% 


81.77% 


81.55% 


81.43% 


APG+rPCA MFCCs_Matrix 


82.17% 


54.44% 


61.75% 


70.47% 


80.33% 


76.22% 


72.96% 
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Figure 4: (a) and (b): Comparisons of robust PC A extracted low-rank features and MFCCs 
matrices in applause/non-applause segments classification, (c) and (d): Comparisons of robust 
PCA extracted low-rank features and MFCCs matrices in laugh/non-laugh segments classifica- 
tion. 



Table 2: Performance comparison between our approach and SVM classification on long vector 



method for laugh/non-laugh segment classification. 



Approach 


Normal 


SNR==-5dB 


SNR^OdB 


SNR=5dB 


LE=10% 


LE==:30% 


LE=50% 


SVM+LV 


81.88% 


60.01% 


60.01% 


60.01% 


60.01% 


60.01% 


60.01% 


APG+MFCCs.Matrix 


90.02% 


53.03% 


63.64% 


70.07% 


54.30% 


52.47% 


52.59% 


SVM+rPCA LV 


75.06% 


60.01% 


60.01% 


60.01% 


74.81% 


74.97% 


74.56% 


APG-^rPCA MFCCs_Matrix 


85.84% 


54.36% 


67.71% 


76.97% 


84.76% 


80.24% 


77.50% 
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6 Conclusions 



In this work, we present a novel framework based on trace norm minimization for audio seg- 
ment classification. The novel method unified feature extraction and pattern classification into 
the same framework. In this framework, robust PCA extracted low-rank component of original 
signal is more robust to corrupted noise and errors, especially to random large errors. We also in- 
troduced online learning algorithms for matrices classification tasks. We obtain the closed-form 
updating rules of the weight matrix and the bias. We derive the explicit form of the Lipschitz 
constant, which saves the computation burden in searching step size. Experiments show that 
even the percent of the original feature elements corrupted with random large errors is up to 
50%, the performance of the robust PCA extracted features almost have no decrease. In future 
work, we plan to test this robust feature in other audio or speech processing related applications 
and extend robust PCA, even trace norm minimization related methods from matrices to the 
more general multi-way arrays (tensors) . Some work related to learning methods are also worth 
considering, such that the alternating between minimization with respect to weight matrix and 
bias may results in fluctuation of target value (even in batch mode) , thus optimization algorithm 
that minimization jointly on weight matrix and bias are required; for multi-classification prob- 
lems with more classes, some hierarchy methods may be introduced to improve the classification 
accuracy. 
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